This is a notebook explaining the mathematics of Linear and Logistic Regression. It is adapted from by blog. All of the code is available on GitHub.

1. Regression Analysis

Regression Analysis is the field of mathematics where the goal is to find a function which best correlates with a dataset. Let's say we have a dataset containing $ n $ datapoints;

$ X = ( x^{(1)}, x^{(2)}, .., x^{(n)} ) $.

For each of these (input) datapoints there is a corresponding (output) $ y^{(i)}$-value.

Here, the $ x$-datapoints are called the independent variables and $ y$ the dependent variable; the value of $ y^{(i)} $ depends on the value of $ x^{(i)} $, while the value of $ x^{(i)}$ may be freely chosen without any restriction imposed on it by any other variable.

The goal of Regression analysis is to find a function $ f(X) $ which can best describe the correlation between $ X $ and $ Y $. In the field of Machine Learning, this function is called the hypothesis function and is denoted as $ h_{\theta}(x) $. If we can find such a function, we can say we have successfully built a Regression model.


If the input-data lives in a 2D-space, this boils down to finding a curve which fits through the data points. In the 3D case we have to find a plane and in higher dimensions a hyperplane.

To give an example, let's say that we are trying to find a predictive model for the success of students in a course called Machine Learning. We have a dataset $ Y $ which contains the final grade of $ n $ students. Dataset $ X $ contains the values of the independent variables. Our initial assumption is that the final grade only depends on the studying time. The variable $ x^{(i)} $ therefore indicates how many hours student $ i $ has studied. The first thing we would do is visualize this data:

If the results looks like the figure on the left, then we are out of luck. It looks like the points are distributed randomly and there is no correlation between $ Y$ and $ X$ at all. However, if it looks like the figure on the right, there is probably a strong correlation and we can start looking for the function which describes this correlation.

 

This function could for example be:

$ h_{\theta}(X) =  \theta_0+ \theta_1 \cdot x $

or

$ h_{\theta}(X) = \theta_0 + \theta_1 \cdot x^2 $

where $ \theta $ are the dependent parameters of our model.


1.2 Multivariate Regression


In evaluating the results from the previous section, we may find the results unsatisfying; the function does not correlate with the datapoints strongly enough. Our initial assumption is probably not complete. Taking only the studying time into account is not enough. The final grade does not only depend on the studying time, but also on how much the students have slept the night before the exam. Now the dataset contains an additional variable which represents the sleeping time. Our dataset is then given by $ X = ( (x_1^{(1)}, x_2^{(1)}), (x_1^{(2)}, x_2^{(2)}), .., (x_1^{(n)}, x_2^{(n)}) ) $. In this dataset $ x_1^{(i)} $ indicates how many hours student $ i $ has studied and $ x_2^{(i)} $ indicates how many hours he has slept. This is an example of [Multivariate Regression](https://en.wikipedia.org/wiki/Linear_regression#Simple_and_multiple_regression). The function has to include both variables. For example:

$ h_{\theta}(x) = \theta_0 + \theta_1 \cdot x_1 + \theta_2 \cdot x_2$

or

$ h_{\theta}(x) = \theta_0 + \theta_1 \cdot x_1 + \theta_2 \cdot x_2^3 $.

1.3 Linear vs Non-Linear Regression


All of the above examples are examples of linear regression. We have seen that in some cases $ y^{(i)} $ depends on a linear form of $ x^{(i)} $, but it can also depend on some power of $ x^{(i)}$, or on the log or any other form of $ x^{(i)}$. However, in all cases its dependence on the parameters $\theta$ was linear.

So, what makes linear regression linear is not that $ Y$ depends in a linear way on $ X $, but that it depends in a linear way on $ \theta$. $ Y $ needs to be linear with respect to the model-parameters $ \theta $. Mathematically speaking it needs to satisfy the superposition principle.  

Examples of nonlinear regression would be:

$ h_{\theta}(x) = \theta_0 + x_1^{\theta_1} $

or

$ h_{\theta}(x) = \theta_0 + \theta_1 / x_1 $


The reason why the distinction is made between linear and nonlinear regression is that nonlinear regression problems are more difficult to solve and therefore more computational intensive algorithms are needed. Linear regression models can be written as a linear system of equations, which can be solved by finding the closed-form solution $ \theta = ( X^TX )^{-1}X^TY $ with Linear Algebra. See these statistics notes for more on solving linear models with linear algebra. As discussed before, such a closed-form solution can only be found for linear regression problems. However, even when the problem is linear in nature, we need to take into account that calculating the inverse of a $ n $ by $ n $ matrix has a time-complexity of $ O(n^3) $. This means that for large datasets ($ n \gt 10.000 $ ) finding the closed-form solution will take more time than solving it iteratively (gradient descent method) as is done for nonlinear problems. So solving it iteratively is usually preferred for larger datasets, even if it is a linear problem.

1.4 Gradient Descent


The Gradient Descent method is a general optimization technique in which we try to find the value of the parameters $ \theta $ with an iterative approach. First, we construct a cost function (also known as loss function or error function) which gives the difference between the values of the hypothesis function $h_{\theta}(x)$ (the values you expect $ Y$ to have with the current values of $ \theta$ ) and the actual values of $ Y $. The better your estimation of $ \theta $ is, the better the values of $ h_{\theta}(x) $ will approach the values of $ Y$ and the smaller the cost function will be. Usually, the cost function is expressed as the squared error of the difference between these functions:

$ J(x) = \frac{1}{2n} \sum_i^n ( h_{\theta}(x^{(i)}) - y^{(i)} )^2 $


At each iteration we choose new values for the parameters $ \theta$, and move towards the 'true' values of these parameters, i.e. the values which make this cost function as small as possible. The direction in which we have to move is the negative gradient direction;

$ \Delta\theta = - \alpha \frac{d}{d\theta} J(x) $.

The reason for this is that  a function's value decreases fastest if we move towards the direction of the negative gradient (the directional derivative is maximal in the direction of the gradient).

Taking all this into account, this is how gradient descent works:

  • Make an initial but intelligent guess for the values of the parameters $ \theta $.
  • Keep iterating while the value of the cost function has not met your criteria:
    • With the current values of $ \theta $, calculate the gradient of the cost function J  ( $ \Delta \theta = - \alpha \frac{d}{d\theta} J(x)$ ).
    • Update the values for the parameters $ \theta := \theta + \alpha \Delta \theta $
    • Fill in these new values in the hypothesis function and calculate again the value of the cost function;

Just as important as the initial guess of the parameters is the value you choose for the learning rate $ \alpha $. This learning rate determines how fast you move along the slope of the gradient. If the selected value of this learning rate is too small, it will take too many iterations before you reach your convergence criteria. If this value is too large, you might overshoot and not converge.

2. Logistic Regression


Logistic Regression is similar to (linear) regression, but adapted for the purpose of classification. The difference is small; for Logistic Regression we also have to iteratively apply the gradient descent method to estimate the values of the parameter $ \theta$. And again, during the iteration, the values are estimated by taking the gradient of the cost function. And also, the cost function is given by the squared error of the difference between the hypothesis function $ h_{\theta}(x)$ and $ Y$. The major difference however, is the hypothesis function itself.

In order to understand the hypothesis function of Logistic Regression, we must first understand the idea behind classification.

When you want to classify something, there are a limited number of classes it can belong to. And for each of these possible classes there can only be two states for $ y^{(i)} $; either $ y^{(i)} $ belongs to the specified class and $ y = 1 $, or it does not belong to the class and $ y = 0 $. Even though the output values $ Y $ are binary, the independent variables $ X $ are still continuous. So, we need a function which has as input a large set of continuous variables $ X $ and for each of these variables produces a binary output. This function, the hypothesis function, has the following form:

$ h_{\theta} = \frac{1}{1 + \exp(-z)} = \frac{1}{1 + \exp(-\theta x)} $.

This function is also known as the logistic function, which is a part of the sigmoid function family. These functions are widely used in the natural sciences because they provide the simplest model for population growth. However, the reason why the logistic function is used for classification in Machine Learning is its 'S-shape'.

As you can see this function is bounded in the y-direction by 0 and 1. If the variable $ z$ is very negative, the output function will go to zero (it does not belong to the class). If the variable $ z$ is very positive, the output will be one and it does belong to the class. (Such a function is called an indicator function.)

The question then is, what will happen to input values which are neither very positive nor very negative, but somewhere 'in the middle'. We have to define a decision boundary, which separates the positive from the negative class. Usually this decision boundary is chosen at the middle of the logistic function, namely at $ z = 0 $ where the output value $ y$ is $ 0.5$.

\begin{equation} y =\begin{cases} 1, \text{if $z \gt 0 $}.\\ 0, \text{if $z \lt 0 $}. \end{cases} \end{equation}

As we can see in the formula of the logistic function, $ z = \theta \cdot x $. Meaning, the dependent parameter $ \theta$ (also known as the feature), maps the input variable $ x$  to a position on the $ z$-axis. With its $ z$-value,  we can use the logistic function to calculate the $ y$ -value. If this $ y$-value $ \gt 0.5 $ we assume it does belong in this class and vice versa.

So the feature $ \theta $ should be chosen such that it predicts the class membership correctly. It is therefore essential to know which features are useful for the classification task. Once the appropriate features are selected , gradient descent can be used to find the optimal value of these features.

How can we do gradient descent with this logistic function? Except for the hypothesis function having a different form, the gradient descent method is exactly the same. We again have a cost function, of which we have to iteratively take the gradient w.r.t. the feature $ \theta $ and update the feature value at each iteration.

This cost function is given by the log-likelihood function known as binary cross-entropy:

 

\begin{equation} \begin{split} J(x) = -\frac{1}{2n} \sum_i^n \left(  y^{(i)} log( h_{\theta}(x^{(i)})) + (1-y^{(i)})log(1-h_{\theta}(x^{(i)})) \right) \\ = -\frac{1}{2n} \sum_i^n \left( y^{(i)} log(\frac{1}{1+exp(-\theta x)}) + (1-y^{(i)})log(1-\frac{1}{1+exp(-\theta x)}) \right) \end{split} \end{equation}

 

We know that:
$ log(\frac{1}{1+exp(-\theta x)}) = log(1) - log(1+exp(-\theta x)) = - log(1+exp(-\theta x))$

and

$ log(1-\frac{1}{1+exp(-\theta x)}) = log( \frac{exp(-\theta x)}{1+exp(-\theta x)}) $

$ = log(exp(-\theta x)) - log(1+exp(-\theta x)) $

$ = -\theta x^{(i)} -  log(1+exp(-\theta x)) $

Plugging these two equations back into the cost function gives us:

\begin{equation} \begin{split} J(x) = - \frac{1}{2n} \sum_i^n \left( - y^{(i)} log(1+exp(-\theta x)) - (1-y^{(i)})(\theta x^{(i)} +  log(1+exp(-\theta x))) \right) \\ = - \frac{1}{2n} \sum_i^n \left(  y^{(i)} \theta x^{(i)} -\theta x^{(i)} -log(1+exp(-\theta x)) \right) \end{split} \end{equation}

 

The gradient of the cost function with respect to $ \theta $ is given by

\begin{align} \frac{d}{d\theta} J(x) = - \frac{1}{2n} \sum_i^n \left(  y^{(i)} x^{(i)} - x^{(i)} + x^{(i)} \frac{ exp(-\theta x)}{1+exp(-\theta x)} \right) \\ = - \frac{1}{2n} \sum_i^n \left( x^{(i)} ( y^{(i)} - 1 +\frac{exp(-\theta x)}{1+exp(-\theta x)} ) \right) \\ = - \frac{1}{2n} \sum_i^n  \left( x^{(i)} ( y^{(i)} - \frac{1}{1+exp(-\theta x)} ) \right) \\ = - \frac{1}{2n} \sum_i^n  \left( x^{(i)} ( y^{(i)} - h_{\theta}(x^{(i)}) )\right) \end{align}

 

So the gradient of the seemingly difficult cost function, turns out to be a much simpler equation. And since it is the gradient we use to update the values of $\theta$ this makes our work much easier.

Gradient descent for Logistic Regression is again performed in the same way:

  • Make an initial but intelligent guess for the values of the parameters $ \theta $.
  • Keep iterating while the value of the cost function has not met your criteria:
    • With the current values of $ \theta $, calculate the gradient of the cost function J  ( $ \Delta \theta = - \alpha \frac{d}{d\theta} J(x)$ ).
    • Update the values for the parameters $ \theta := \theta + \alpha \Delta \theta $
    • Fill in these new values in the hypothesis function and calculate again the value of the cost function;

2.2 Logistic Regression - Example 1


Now that we have looked at the theory, let's lets have a look at an example of Logistic Regression. We will start out with the example of students passing a course or not.

Let's generate some data points. There are $n = 300$ students participating in the course Machine Learning and whether a student $ i $ passes ( $ y_i = 1 $) or not ( $ y_i = 0$ ) depends on two variables;

  • $x_i^{(1)}$ : how many hours student $ i $ has studied for the exam.
  • $x_i^{(2)}$ : how many hours student $ i $ has slept the day before the exam.
 

In our example, the results are pretty binary; everyone who has studied less than 4 hours fails the course, as well as everyone whose studying time + sleeping time is less than or equal to 13 hours ($x_i^{(1)} + x_i^{(2)} \leq 13$).


In [2]:
import random
import numpy as np

def func2(x_i):
    if x_i[1]+x_i[2] >= 13: 
        return 0
    else:
        return 1

def generate_data2(no_points):
    X = np.zeros(shape=(no_points, 3))
    Y = np.zeros(shape=no_points)
    for ii in range(no_points):
        X[ii][0] = 1
        X[ii][1] = random.random()*9+0.5
        X[ii][2] = random.random()*9+0.5
        Y[ii] = func2(X[ii])
    return X, Y

X, Y = generate_data2(300)

The results looks like this (the green dots indicate a pass and the red dots a fail)



We have a LogisticRegression class, which sets the values of the learning rate $\alpha$ and the maximum number of iterations max_iter at its initialization.
The values of X, Y are set when these matrices are passed to the “train()” function, and then the values of no_examples, no_features, and theta are determined.

We also have the hypothesis, cost and gradient functions.

The gradient descent method uses these methods to update the values of $theta$.

The “train()” method, first sets the values of the matrices X, Y and theta, and then calls the gradient_descent method.

Once the $\theta$ values have been determined with the gradient descent method, the training of the classifier is complete and we can use it to classify new examples.


In [5]:
import numpy as np

class LogisticRegression():
    """
    Class for performing logistic regression.
    """
    def __init__(self, learning_rate = 0.7, max_iter = 1000):
        self.learning_rate = learning_rate
        self.max_iter = max_iter
        self.theta = []
        self.no_examples = 0
        self.no_features = 0
        self.X = None
        self.Y = None
        
    def add_bias_col(self, X):
        bias_col = np.ones((X.shape[0], 1))
        return np.concatenate([bias_col, X], axis=1)
              
    def hypothesis(self, X):
        return 1 / (1 + np.exp(-1.0 * np.dot(X, self.theta)))

    def cost_function(self):
        """
        We will use the binary cross entropy as the cost function. https://en.wikipedia.org/wiki/Cross_entropy
        """
        predicted_Y_values = self.hypothesis(self.X)
        cost = (-1.0/self.no_examples) * np.sum(self.Y * np.log(predicted_Y_values) + (1 - self.Y) * (np.log(1-predicted_Y_values)))
        return cost
        
    def gradient(self):
        predicted_Y_values = self.hypothesis(self.X)
        grad = (-1.0/self.no_examples) * np.dot((self.Y-predicted_Y_values), self.X)
        return grad
        
    def gradient_descent(self):
        for iter in range(1,self.max_iter):
            cost = self.cost_function()
            delta = self.gradient()
            self.theta = self.theta - self.learning_rate * delta
            if iter % 100 == 0: print("iteration %s : cost %s " % (iter, cost))
        
    def train(self, X, Y):
        self.X = self.add_bias_col(X)
        self.Y = Y
        self.no_examples, self.no_features = np.shape(X)
        self.theta = np.ones(self.no_features + 1)
        self.gradient_descent()
  
    def classify(self, X):
        X = self.add_bias_col(X)
        predicted_Y = self.hypothesis(X)
        predicted_Y_binary = np.round(predicted_Y)
        return predicted_Y_binary

2.3 Logistic Regression - Example 2: Iris Dataset


Now  that the concept of Logistic Regression is a bit more clear, let's classify real-world data! One of the most famous classification datasets is The Iris Flower Dataset. This dataset consists of three classes, where each example has four numerical features.:


In [7]:
import pandas as pd
from evaluators import *
 
to_bin_y = { 1: { 'Iris-setosa': 1, 'Iris-versicolor': 0, 'Iris-virginica': 0 },
             2: { 'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 0 },
             3: { 'Iris-setosa': 0, 'Iris-versicolor': 0, 'Iris-virginica': 1 }
             }
 
#loading the dataset
datafile = '../datasets/iris/iris.data'
df = pd.read_csv(datafile, header=None)
df_train = df.sample(frac=0.7)
df_test = df.loc[~df.index.isin(df_train.index)]
X_train = df_train.values[:,0:4].astype(float)
y_train = df_train.values[:,4]
X_test = df_test.values[:,0:4].astype(float)
y_test = df_test.values[:,4]
 
Y_train = np.array([to_bin_y[3][x] for x in y_train])
Y_test = np.array([to_bin_y[3][x] for x in y_test])
 
print("training Logistic Regression Classifier")
lr = LogisticRegression()
lr.train(X_train, Y_train)
print("trained")
predicted_Y_test = lr.classify(X_test)
f1 = f1_score(predicted_Y_test, Y_test, 1)
print("F1-score on the test-set for class %s is: %s" % (1, f1))


training Logistic Regression Classifier
iteration 100 : cost 0.617481190363 
iteration 200 : cost 0.0867172909824 
iteration 300 : cost 0.0849155519757 
iteration 400 : cost 0.083336248398 
iteration 500 : cost 0.0819359612358 
iteration 600 : cost 0.080681546988 
iteration 700 : cost 0.0795474613777 
iteration 800 : cost 0.078513817168 
iteration 900 : cost 0.0775649711536 
trained
F1-score on the test-set for class 1 is: 0.96

As you can see, our simple LogisticRegression class can classify the iris dataset with quiet a high accuracy.


In [ ]: